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Abstract. We use a graphics processing unit (GPU) for fast computations of Monte Carlo integrations. 
Two widely used Monte Carlo integration programs, VEGAS and BASES, are parallelized on GPU. By 
using W + plus multi-gluon production processes at LHC, we test integrated cross sections and execution 
time for programs in FORTRAN and C on CPU and those on GPU. Integrated results agree with each 
other within statistical errors. Execution time of programs on GPU run about 50 times faster than those 
in C, and more than 60 times faster than the original FORTRAN programs. 
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1 Introduction 

GPU (Graphics Processing Unit) is originally developed 
for fast output of moving complex images onto computer 
displays. Because it is composed of many multi-processors, 
it can be used as a powerful parallel processor not only for 
graphics applications but also for general purpose compu- 
tations. GPU has already been used in scientific appli- 
cations which require a huge number of calculations to 
process data as in astrophysics and fluid dynamics. Also 
in elementary particle physics, successful computations of 
various cross sections on GPU have been reported in [TJ 
[2] . In these studies programs on GPU were shown to run 
about 100 times faster than those on CPU. 

Two-orders of magnitude reduction of computation time 
by GPU demonstrated in the previous studies should greatly 
improve the efficiency of analysis in the field of elementary 
particle physics. In this paper, we show that general pur- 
pose Monte Carlo integration programs can be adopted 
to run on GPU, opening the door of fast and economi- 
cal computation to all area of research that makes use of 
Monte Carlo method. 
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Scattering amplitudes of physics processes at LHC ener- 
gies are expressed as complex functions of momenta and 
helicities of external particles, and kinematical distribu- 
tions of produced particles are obtained by integrals of the 
squared amplitudes over many-body phase space. Because 
integration of multi-dimensional function is most conve- 
niently done with Monte Carlo integration technique, the 
method is widely used in the field of elementary particle 
physics. They are especially useful when evaluating differ- 
ential cross sections with experimental cuts on produced 
particle momenta. 



As the number of final state particles increases, the 
computation time which is necessary to obtain good ac- 
curacy of the integrated results grows quickly. It is partly 
because the number of sampling points during the integra- 
tion should be large for higher dimensional integral, and 
partly because the computation time for the scattering 
amplitudes also increases as the number of external parti- 
cles grows. Therefore, integration of differential cross sec- 
tion with good accuracy becomes a very time consuming 
task for multi-particle productions processes. Significant 
reduction of computation time by the use of GPU will 
contribute to the improvement of the efficiency of physics 
analysis at LHC and elsewhere. 

VEGAS 3 and its variants are widely used for Monte 
Carlo integration. They are based on an iterative and 
adaptive Monte Carlo scheme. In these programs each axis 
of variable is divided into grids, thus the integrand vol- 
ume is divided into hyper cubes. Monte Carlo integration 
is performed in each hypercube and variances from hyper- 
cubes are used to define new grid spacings which are used 
in the next iteration step. The variance of total integral is 
reduced by iteration by iteration. BASES [I] is one of its 
variants developed at KEK, which has been widely used 
in particle physics at colliders. 

In this paper we study parallelization of VEGAS and 
BASES, by using GPU. 



3 Parallelization of Monte Carlo integration 
program 

3.1 Program structure 

Multi-dimensional integration programs, VEGAS and BASES, 
have the following common structure: 
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1. initialize parameters, 
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2. generate a space point within a fc-dimensional hyper- 
cube from a set of k random numbers, 

3. compute an integrand function at the generated space 
point, 

4. accumulate function values, 

5. optimize grid spacing after accumulating N function 
values, 

6. repeat 2-5 steps up to M iterations or until the desired 
accuracy is reached. 

In BASES, after M iterations (grid optimization phase) 
are done, further iteration steps are executed in order 
to improve the accuracy of the integration (integration 
phase) . The results of this integration phase are used for 
event generations by SPRING [J]. 

We measure fractions of CPU time for each step and 
find that almost 98-99% of total CPU time is used in the 
step 3 where integrand function is computed. This frac- 
tion grows as the number of sampling points grows and 
the complexity of the integrand function grows. There- 
fore significant reduction of total CPU time is expected 
by parallelizing function calls at all sampling points with 
GPU. 

3.2 Program conversion 

Both VEGAS and BASES are originally written in FOR- 
TRAN. In order to transfer function calls to GPU, they 
should be written in CUDA [5], C/C++ style platform 
developed for general purpose computing on GPU. We 
first convert the FORTRAN programs into C codes. Then 
we transform the function call part further into CUDA 
codes. Due to the limited support for double precision 
computation capabilities of the GPU which we use for 
this study [T], floating point computations in the GPU 
programs are done in single precisiorQ- We check results 
and performances of all programs of FORTRAN, C and 
GPU versions. 

In the programs of GPU version, all sampling points 
are generated on GPU and integrand function values at 
each space point is computed in parallel (steps 2-3). Then 
computed function values are transferred to CPU memo- 
ries. At the CPU side, computed function values are ac- 
cumulated and grid parameters are optimized based on 
the accumulated information (steps 4-5). These steps are 
iterated and variance of integral are reduced. 

4 Computing environments 
4.1 GPU and its host PC 

We use a GeForce GTX285 by NVIDIA [JJ for the com- 
putation of cross sections of physics processes with Monte 
Carlo integration. The GeForce GTX285 which is con- 
nected with PCI Express2xl6 bus has 30 streaming multi- 
processors (SM). Since each SM has 8 streaming proces- 
sors (SP), the GTX285 GPU card has 240 SP in total. 

1 This limitation is relaxed for NVIDIA's GPUs with newer 
architecture [6]. 



Other parameters of the GTX285 are summarized in Ta- 
bled] 

The GTX285 is controlled by Linux PC with FedoralO 
(64bit) operating system. The parameters of host com- 
puter is summarized in Table. [2l 

In order to compile programs of GPU version, we use 
the CUDA version 2.3 toolkit which are obtained from the 
NVIDIA site [7J. And for the programs in FORTRAN and 
C, we use gfortran and gec which is automatically installed 
with Fedora 10. The version of compilers are summarized 
in Table. H 



Table 1. Parameters of GTX285 



Number of 
multiprocessor 


30 


Number of core 


240 


Total amount of 
global memory 


2GB 


Total amount of 
constant memory 


64kB 


Total amount of shared 
memory per block 


16kB 


Total number of registers 
available per bloc 


16kB 


Clock rate 


1.48GHz 



Table 2. Host PC environment 





CPU 


Core i7 2.67GHz 






L2 Cache 


8MB 






Memory 


6GB 






Bus Speed 


1.333GHz 






OS 


Fedora 10 (64 bit) 




Table 3. development environment 


nvee 


Rel. 2.3 (VO.2.1221) 


CUDA Driver 


Ver.2.30 


CUDA Runtime 


Ver 2.30 


gec 


4.3.2 (Red Hat 4.3.2-7) 


gfortran 


4.3.2 (Red Hat 4.3.2-7) 



4.2 Process time measurement 

For comparisons of execution time, we measure the time 
between the start of VEGAS/BASES programs and the 
end of them, i.e. between the step 1 and the completion of 
the step 6, including the steps 4 and 5 that are processed 
on CPU. For FORTRAN programs, an intrinsic procedure 
of gfortran, "cpu_time", is used for the measurement of 
elapsed CPU time. For C and GPU programs, a system 
call, "getrusage", is used for the time measurements. 
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5 Physics process 

In order to test the GPU version of VEGAS and BASES, 
called g VEGAS and gBASEffl respectively, we compare 
total cross sections of multi-particle production process 
at LHC. In particular, we report results on the following 
processes 

ud W + (-^ li + v,j) + n gluons (n = ~ 4) (1) 

with semi-realistic final state cuts at LHC. The dimension 
of integral is 3(n+2)— 4 from the phase space, 2 from the 
parton distributions (PDF), and 1 for the helicity summa- 
tion, and hence 3n+5; hence from 5-dimensional integral 
for no gluon (n — 0) to 17-dimensional integral for 4 gluons 
(n = 4). 

The degree of the complexity (length) of the integral 
function can be estimated from the number of contributing 
Feynman diagrams and the number of independent color- 
basis vectors as listed in Table|4] Previous studies [T| show 
that the performance of GPU computation is limited by 
the product of these two numbers, the processes eq. ((TJ 
cover program size of four orders of magnitude difference. 

In order to simulate realistic LHC experiments, We 
introduce the following final state cuts. For gluons, 

\r)i\ < 5, (2a) 
p Ti > 20 GeV, (2b) 
p Ttj > 20 GeV, (2c) 

where r\i and pTi are the rapidity and the transverse mo- 
mentum of the i-th jet, respectively, in the pp collisions 
rest frame along the right- moving (p z = \p\) proton mo- 
mentum direction, and PTij is the relative transverse mo- 
mentum [5] between the jets i and j defined by 

PTij = min(p Ti ,p Tj ) ARij, (3a) 
AR l3 = y/Ariij+Atij. (3b) 

Here ARij measures the boost-invariant angular sepa- 
ration between jets. For ji + from W + decay, we require 



\ m \ < 2.5, (4a) 
PTi > 20 GeV (4b) 



As for the parton distribution function (PDF) , we use 
the set CTEQ6L1 9. and the factorization scale is chosen 
to be the Z boson mass. The QCD coupling constant is 
also fixed as a s ( m z)is = 0.118 [ID) . 

For the computation of helicity amplitudes of these 
processes, HELAS [IT] for FORTRAN programs and its 
C/GPU version, HEGET [1; are used. 



2 Sample source codes of gVEGAS are available on the web 
page: http : //madgraph . kek . jp/KEK/GPU/gVEGAS/example/ 



Table 4. ud — >■ W + (— > H + Vn) + gluons 



Number of 


Number of 


Number of 


gluons 


diagrams 


color bases 





1 


1 


1 


2 


1 


2 


8 


2 


3 


54 


6 


4 


516 


24 



Table 5. Parameters for integrations 



Number of 
gluons 


NCALL 


ITMX 


ITMX1 


ITMX2 





10 6 


10 


5 


5 


1 


10 6 


10 


5 


5 


2 


10 6 


10 


5 


5 


3 


10 7 


10 


5 


5 


4 


10 7 


10 


5 


5 



6 Results 

6.1 Parameters of the integration programs 

In order to control the behavior of the Monte Carlo inte- 
gration by VEGAS and BASES, user can give them the 
following parameters: 

— number of total function calls in one iteration step 
(NCALL), 

— number of maximum iteration steps (ITMX), and 

— desired accuracy of the integration (ACC). 

NCALL is the number N in step 5 and ITMX is the number 
M of the step 6 in Section 13.11 Iteration steps of BASES 
are separated into two phases: the grid optimization step 
and the integration step. Accordingly, ITMX and ACC are 
also separated as: 

— number of maximum iteration steps (ITMX1), and 

— desired accuracy of integration (ACC1) 

for the grid optimization phase, and 

— number of maximum iteration steps (ITMX2), and 

— desired accuracy of integration (ACC2) 

for the integration phase. 

Parameter values used in this study are summarized in 
Table [5] In order to keep the total amount of computations 
to be the same among all the programs, all desired accu- 
racies, ACC for VEGAS and ACC1 and ACC2, are set to an 
extremely small value (0.001%) which cannot be reached 
by MC sampling of NCALLxITMX points used in this study: 
see Table [5] For BASES, numbers of iteration steps for 
the grid optimization and integration phases are set to be 
equal (ITMX1 = ITMX2), and their sum is set the same as 
ITMX of VEGAS programs (ITMX1+ITMX2=ITMX). In sum- 
mary, we accumulate 10 7 sample points for processes up 
to two gluons (n = 0, 1,2) and 10 s points for those with 
more gluons (n = 3 and 4). 
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No. of 
gluons 


VEGAS 


BASES 


MG/ME 


[fb] 


FORTRAN 


C 


GPU 


FORTRAN 


C 


GPU 





2.137±0.001 


2.138±0.001 


2.137±0.001 


2.137±0.001 


2.137±0.001 


2.137±0.001 


2.138±0.002 


xlO 6 


1 


1.783±0.001 


1.783±0.001 


1.780±0.001 


1.785±0.001 


1.784±0.001 


1.782±0.001 


1.773±0.003 


xlO 5 


2 


1.873±0.007 


1.853±0.006 


1.843±0.006 


1.876±0.007 


1.883±0.010 


1.870±0.007 


1.874±0.002 


xlO 4 


3 


2.868±0.008 


2.881±0.009 


2.832±0.010 


2.860±0.010 


2.855±0.014 


2.907±0.012 


2.845±0.005 


xlO 3 


4 


6.186±0.041 


6.054±0.081 


6.157±0.073 


6.078±0.134 


6.191±0.068 


6.385±0.235 


6.070±0.010 


xlO 2 



Table 6. Total cross sections of ud — > 14 7+ (— ¥/j, + v m ,) + ro-gluons 
computed by programs in FORTRAN, C, CUDA (GPU) and MadGraph/MadEvent. 



6.2 Total cross section computation 

Total cross sections for processes in eq. (UJ) with exper- 
imental cuts (eqs. [2E]) are listed in Table [S] They are 
computed with programs in FORTRAN, C and CUDA 
(GPU). Cross sections from different of programs agree 
to each other within their statistical errors. In addition, 
they agree with the results from the event generator Mad- 
Graph/MadEvent [mill]. 



6.3 Parameters of the kernel program 

The performance of GPU programs largely depends on 
parameters of kernel programs executed on GPU. Most 
significant parameters which affect the process time of 
programs are: 

— number of registers allocated to a thread, and 

— number of threads in a thread block. 

Details of kernel parameters are explained in pQ. In this 
study we use 64 as a number of registers allocated a thread 
and 256 as a number of threads in a thread block. From 
the detailed study of dependence of performance on these 
parameters we find that they give almost the best perfor- 
mance for all processes in this study. 

Number of thread blocks in a grid (= a set of thread 
blocks), which is executed with a single kernel call, are set 
to be equal to NCALL, so that one iteration of Monte Carlo 
integration steps is executed by a single kernel call. 



6.4 Process time comparisons 

In Table [7] measured process time for a single function call 
is listed for all programs. As explained above, the process 
time per single function call is obtained by dividing the 
total computation time by 10 7 for processes with up to 
two gluons (n — 0,1,2) and by 10 s for those with more 
gluons (n — 3 and 4) . 

Numbers in parentheses in the FORTRAN and C columns 
in Table [7] are the ratio of the process time as compared 
to that of GPU. About a factor of 50 times more sam- 
pling is possible with GPU as compared to the C pro- 
grams on CPU. During the comparison of process time, 
we find that the original FORTRAN codes run slower 
than the C-version. Because the total process time for 



these CPU programs is dominated by the function (am- 
plitude) computation, this FORTRAN-to-C ratio can be 
originated from the difference of handling complex num- 
bers which appears in amplitude computations. We use 
in-line functions for the computations of complex num- 
bers in C, which might have better efficiency compared 
with built-in complex functions in FORTRAN. 

In Fig. [1] process time for a single function call is plot- 
ted against the number of gluons in the final state. And 
in Fig. [5]ratios of process time between programs on CPU 
(FORTRAN/C) and those on GPU are plotted. Differ- 
ences of process time between VEGAS and BASES are 
small. Programs which are executed on GPU can run 
about 50 times faster than those in C. Compared with 
original FORTRAN version programs, the differences of 
performance become larger. 

When the final state has 4 gluons, the size of GPU 
program becomes large and requires more access to local 
memories. From previous studies on performances of GPU 
programs [1 , larger programs show worse performance on 
execution time. Still the VEGAS(BASES) programs for 
the 4 gluon production process runs 40 (34) times faster 
on GPU than the C-program runs on CPU. 



gio 3 

tfl 

3io 2 



o 



ud -> W* + gluons 




1 2 3 

Number of Gluons in Final State 

Fig. 1. Process time of a single function call for ud — >W + {- 
/i + ^ M ) + Ji-gluons. 
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No. of gluons 


VEGAS [/usee] 


BASES [fisec] 


FORTRAN 


C 


GPU 


FORTRAN 


C 


GPU 





1.32 (63.8) 


1.06 (51.2) 


0.0207 


1.78 (68.7) 


1.39 (53.5) 


0.0260 


1 


2.19 (68.8) 


1.73 (54.6) 


0.0318 


2.97 (75.0) 


2.24 (56.6) 


0.0396 


2 


4.19 (84.2) 


2.96 (59.5) 


0.0497 


4.97 (88.3) 


3.35 (59.6) 


0.0563 


3 


11.1 (101) 


7.00 (63.6) 


0.110 


11.7 (103) 


7.02 (62.2) 


0.113 


4 


72.1 (77.8) 


37.4 (40.4) 


0.927 


61.6 (66.2) 


31.8 (34.2) 


0.931 



Table 7. Process time for a single function call in VEGAS and BASES on CPU with FORTRAN or C, and on GPU with 
CUDA. Numbers in the parentheses of the FORTRAN and C columns are the ratio of process time relative to that of GPU. 







- VEGAS 


FORTRAN 




-A- 


- BASES 


















c 4 


1 






i 



Number of Gluons in Final State 



Fig. 2. Process time ratios of FORTRAN and C programs to 
the corresponding GPU program 



Appendix A Sample codes for gVEGAS 

Sample source codes of the gVEGAS program are avail- 
able from the web page: 

http : //madgraph . kek . j p/KEK/ GPU/ gVEGAS/example/ . 

They include a minimum set of source files which are nec- 
essary to do Monte Carlo integration with the VEGAS 
algorithm on GPU, but do not include Makefile which 
largely depends on user's environment of development. 



Appendix A.l User programs 

Sample codes include two user programs: gVegasMain. cu 
and gVegasFunc. They should be customized by user to 
the task one intends to perform. 



7 Summary 



Based on VEGAS and BASES programs written in FOR- 
TRAN, we have developed Monte Carlo integration pro- 
grams, gVEGAS and gBASES respectively, which can be 
executed on NVIDIA's GPU using the CUDA develop- 
ment kit. We have tested their performance with the com- 
putation of total cross sections of processes, ud— > W + (— >• 
+ n-gluons (n = ~ 4), in pp collisions at \fs = 
14TeV. Total cross sections agree with each other within 
statistical errors for all programs. Both VEGAS and BASES 
programs on GPU run about 50 times faster than the same 
programs written in C, which are converted from the orig- 
inal FORTRAN version programs. Compared with FOR- 
TRAN programs their GPU version programs show more 
than 60 times better performance in execution time. For 
the process with 4 gluons, the size of GPU programs be- 
comes large and their relative performance become worse 
than small programs. 
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Appendix A. 1.1 gVegasMain. cu 

gVegasMain . cu includes a sample main program for Monte 
Carlo integration where user can set parameters for the 
integration. Typical parameters are: 

nBlockSize size of a thread block of a kernel program 
on GPU 

ndim number of independent variables of integrand 

function 

ncall number of sample points per iteration 

itmx maximum number of iterations 

acc required accuracy during iterations 

All these parameters can be set within gVegasMain . cu. 



Appendix A. 1.2 gVegasFunc . cu 

User function program integrated in the program is de- 
scribed in gVegasFunc . cu. The calling sequence of user 
functions is 

float func(float* rx, float wgt) 

where rx includes a set of variables and wgt is a function 
weight. 



6 



J. Kanzaki: Fast Monte Carlo integration on GPU 



Appendix A. 2 Internal programs 

The gVEGAS consists of the following programs which 
are included in the sample codes: 

gVegas . cu main program of gVEGAS system 
gVegasCallFunction . cu 

kernel program which runs on GPU called 

from gVegas . cu. 
xorshif t . cu random number generator on GPU 

Appendix A. 3 Header files 

The following header files which are necessary for the gVE- 
GAS system are also included in the sample codes: 

gvegas.h includes nBlockSize which user 
can set in gVegasMain . cu 

vegasconst . h includes internal constants which 
are located at constant memory of 
GPU 

vegas . h includes internal gVEGAS parameters 

kernels . h a list of kernel programs which are 
included at CUDA compilation. 
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